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Setting Matters Straight 


Relational Database Principles 


Climbing Trees in SQL 


Part I 





The Relational Treatment of Data Hierarchies 


By Fabian Pascal 


A: Usual, the discussion begins with a question — a good question that | see 


all too often: 


Can somebody help me with some 
advice on how to display records from 
a ‘tree’ structure table in a specified 


order? The table has three fields (see 
Figure 1). I need a SQL SELECT state- 
ment that will produce a bottom-to- 
top display ordered by country. 


Canada [210 

USA | 210 
Figure 1: The table in question. The 
fields describe a tree structure. 
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The real table has an unknown number 
of hierarchy levels, so I can’t use a fixed 
number of parent level calls. Is there a 
generic SELECT-based approach that 
provides a recursive process that applies 
a level number to each record and 
somehow includes that as part of a 


larger SELECT statement? 


Such queries are nothing new — they are 
an integral part of what is known in manu- 
facturing as bill of materials explosions. In 
fact, I wrote an article on the subject in the 
early 80s. But, as I explained then, 
although guidelines for relationally correct 
solutions were suggested early on, standard 


SQL still offers none. Consequently, users 
who need this capability continue to ask the 
above question again and again. (SQL3 has 
a feature called recursive union, but Date 
finds it extremely difficult to understand. It 
isnt supported by products anyway.) 


A few proprietary attempts by vendors to fill 
the gap in their SQL implementations have 
usually ignored both relational principles, as 
well as principles of good language design. 
Thus, even when users design databases 
properly, they end up having to resort to, at 
best, a highly procedural multi-statement 
solution, or, at worst, application code. Such 
solutions are complex, inaccessible to end- 
users, and suffer from poor or no DBMS 
optimization. The users blame, by implica- 
tion, the purportedly relational nature of 
SQL DBMSs for the deficiency, when in 
fact, the problem is due to the industry’s dis- 
regard for “all that theoretical stuff.” 


In this first installment of a two-part arti- 
cle series, I discuss the basic issues in rela- 
tional treatment of data hierarchies and 
illustrate the use of CONNECT BY, 
Oracle’s proprietary extension to SQL. 
Next month I will outline a relationally- 
correct solution, and show how CON- 


NECT BY fares relative to it. 


Hierarchies and Tables 

At issue are true hierarchies where every 
“child” node has exactly one “parent” node. 
They are called “tree structures” for obvi- 
ous reasons, and Figure 2 shows the one 
underlying the user's question. 
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Total 
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Figure 2: The information in Figure 1 is better depicted in a 
diagram. 


Older hierarchical DBMSs were targeted at such data 
structures (and thus, they had to impose a hierarchy even 
when none existed). In hierarchical databases, structural 
information — that is, the relationships between nodes — 
was represented at the physical level by pointer-based links 
that were exposed to users in applications. To access infor- 
mation and manipulate it, complex procedures had to be 
written that explicitly “navigated” the links, usually, one- 
record-at-a-time. Hierarchical databases were therefore out 
of the reach to all but the most technically proficient pro- 
grammers, and wreaked havoc with data independence. 


The relational model was devised to do away with such 
complications. By representing tree structures in tables, 
such that data can be manipulated /ogically at the set level, 
simpler access and DBMS optimization would be possible. 
But hierarchical and, more recently, object DBMS propo- 
nents, criticize RDBMSs for their purported inability to 
“handle” tree structures well. 


To understand the fallacy in this argument, it’s important to 
separate “handling” into structural and manipulative compo- 


nents and distinguish between relational and SQL DBMSs. 


Relational Representation 

There is nothing unique to the relational design of databas- 
es representing tree structures. With nodes perceived as the 
entity types of interest, the “links” between them are the 
relationships between the entities of those types. In the 
above case they are one-to-many (1:M) between parent and 
child entities, so a normalized design yields two tables, 


NODES and LINKS, as shown in Figure 3. 


Given the 1:M relationships and unique nodes, CODE 
and CCODE are the primary keys, respectively, and 
CCODE is also a foreign key in LINKS referencing 
CODE in NODES. 


Note very carefully that the tree structure can be determined 
from LINKS regardless of row order! In other words, the 
table obeys the relational discipline of inessential ordering — 
no meaningful information is embedded in row order. 
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Figure 3: Normalizing the design yields two tables. 


Thus, hierarchies such as part assemblies, organizational 
charts, or authorization graphs in DBMS catalogs can be 
readily represented relationally in tables. 

SOL Manipulation 
A common type of hierarchi- 
cal query is “Find the chil- 
dren of parent X.” In the 
previous user example, X is 
the root node of the tree 
(100) and the result, say 
TREE100, ought to contain 
all the nodes underneath it 
(see Figure 4). Such results 





“explode” the tree structure 


Figure 4: Nodes of the parent 
shown in Figure 3. 


under specific nodes. 


Ignore for the time being the PCODE column; we'll come 
to it later. If you try to devise a single standard SQL state- 
ment that produces a table with just the CCODEs, you 
will find, like our user, that there isn’t one. 


That’s why some vendors came up with proprietary exten- 
sions; for instance, Oracles CONNECT BY clause. For 


example, the statement: 


SELECT ccode 
FROM links 
CONNECT BY PRIOR ccode = pcode 
START WITH ccode = 100 


will produce the above result with just the CCODE column. 


The simplicity of this particular tree obscures the problem 
with this type of result. Consider the part assembly hierar- 
chy shown in Figure 5. There are many-to-many (M:M) 
relationships; the same part can occur multiple times in 
the tree, even at the same level. The equivalent to LINKS 
is PLINKS with {CCODE, PCODE} as its primary key 
(see Figure 6). 


Note: It may seem that because {P5,P3} and {P6,P3} rows 
appear only once in the table, some relationships are miss- 
ing. But a careful consideration of a// the rows in the table 
should help you realize that all relationships are represent- 
ed without duplicates. 
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Figure 5: A part assembly depicted as a tree diagram. 


This SELECT statement with a CONNECT BY clause: 


SELECT ccode 

FROM plinks 
CONNECT BY PRIOR ccode = pcode 
START WITH ccode = 'P1' 


yields the result shown in 
Figure 7, which contains 
duplicates. Note that the 
duplicates represent differ- 
ent nodes in the structure, 
which happen to be of the 
same type. This means that 
structural information — 
the location of those nodes 
in the tree — is “hidden” 
in the order of the rows. 
Reshuffle the rows or elimi- 
nate the duplicates and that 
information will be lost! 





Figure 6: The PLINKS table. 


Which is why CONNECT BY has been 
ae PITREE 
criticized by relational proponents on the 
grounds that it violates relational closure. 
The criticism is frequently dismissed by 


both vendors and users. For example: 


The result of a SQL statement ... is a 
table and can be used in another SQL 
statement ... The proprietary clause does 
have restrictions on its use with sub- 
queries and joins, but ... it does gener- 
ate a table and, thus, satisfies the defini- 
tion of closure. 


In other words, a table is a table is a table, 


Figure 7: This 
result contains 
duplicate rows. 


so what’s the fuss? 


The problem is, of course, that closure 

requires results to be not just any tables, but relational 
tables, that contain unordered, unique rows (of atomic 
values). It is this discipline that gives tables the mathe- 
matical properties of relations. Insistence on R-tables is 





not for purely theoretical reasons, but because practical 
benefits are predicated on those mathematical proper- 
ties. A relational DBMS must, therefore, enforce the 
discipline on all tables, including result tables, because 
as Date countered: — 


If ... essentially ordered tables are ... fed as input to 
another operation, then the information represented by 
the ordering of the rows will simply be lost. A simple 
illustration is provided by the operation DISTINCT, 
which will actually eliminate duplicates, thus quite def- 
initely causing information to be lost. (It will also 
destroy the relative sequence of the other rows, causing 
additional information to be lost as well.) 


So, first, reshuffling the rows — required by some opera- 
tions, or desirable for optimization — results in loss of 
essential information. And, second, this is certainly the 
case if the duplicates are eliminated (for example, by 
union to which, as we shall see, is pertinent to explosions). 


Stay Tuned 

Few of those criticizing the performance of SQL 
DBMSs realize that it’s SQL or the products’ relation- 
al violations — rather than adherence to the model — 
that cause performance problems. The need to pre- 
serve the duplicates and a specific order in results 
inhibits an optimizer’s ability to choose the best possi- 
ble execution procedure. 


For the correct solution — and to find out if CONNECT 
BY can produce it — stay tuned. El 
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